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Bayesian Nonparametric Shrinkage 
Applied to Cepheid Star Oscillations 

James Berger, William H. Jefferys and Peter MiJIIer 



Abstract. Bayesian nonparametric regression with dependent wavelets 
has dual shrinkage properties: there is shrinkage through a dependent 
prior put on functional differences, and shrinkage through the setting 
of most of the wavelet coefficients to zero through Bayesian variable 
selection methods. The methodology can deal with unequally spaced 
data and is efficient because of the existence of fast moves in model 
space for the MCMC computation. 

The methodology is illustrated on the problem of modeling the os- 
cillations of Cepheid variable stars; these are a class of pulsating vari- 
able stars with the useful property that their periods of variability are 
strongly correlated with their absolute luminosity. Once this relation- 
ship has been calibrated, knowledge of the period gives knowledge of 
the luminosity. This makes these stars useful as "standard candles" for 
estimating distances in the universe. 

Key words and phrases: Nonparametric regression, wavelets, shrink- 
age prior, sparsity, variable selection methods. 



1. INTRODUCTION 

1.1 Nonparametric Bayesian Shrinkage 

Bayesian analysis has long been a major method- 
ological vehicle for implementation of shrinkage ideas 
in complex scenarios. There are two primary ways 
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in which such shrinkage is implemented. The first 
is through use of prior distributions which shrink 
the unknowns in some fashion — to prespecified lo- 
cations or prespecified subspaces, depending on the 
problem and type of prior. Thus an unknown nor- 
mal mean could be shrunk toward a specified prior 
mean; a collection of unknown normal means could 
be shrunk toward the hyperplane in which the means 
are equal; and an unknown real function could be 
shrunk toward the subspace of monotonic functions. 
This is the Bayesian version of the type of shrinkage 
originating with Stein (1956) and James and Stein 
(1961). 

The second major Bayesian vehicle for shrinkage 
is Bayesian variable selection, which sets some of 
the unknown parameters to zero. This is often an 
overly drastic shrinkage, but is certainly not so in 
the context of model selection, or in the context of 
nonparametric function estimation. In the latter set- 
ting, the unknown parameters that are set to zero 
are typically coefficients of basis elements from a ba- 
sis representation of the function, and sparsity con- 
siderations strongly encourage such shrinkage. 
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Fig. 1. The radial velocity data (the x 's) for T Mon, and their fit to a fifth-order trigonometric polynomial. 



Both of these shrinkage concepts are herein uti- 
lized in nonparametric function estimation with de- 
pendent wavelets. The motivating application is to 
Cepheid variable stars and is described in the next 
subsection; the functions to be estimated can have 
arbitrary shapes, but are quite smooth. It is to in- 
duce sufficient smoothness that will utilize both ty- 
pes of shrinkage discussed above. 

1.2 The Astronomical Problem 

There is a class of stars, called Cepheid variables, 
that pulsate with a regular and distinctive periodic 
signature. The stars actually grow larger and then 
smaller, and as a result their luminosities vary peri- 
odically along with their colors. Since there is a phys- 
ical relationship between the star's linear diameter, 
its luminosity, and its color, there are actually two 
independent periodically varying quantities. 

A very interesting and useful property of these 
stars is that their mean luminosities are highly corre- 
lated with their pulsation period, in that the shorter- 
period stars are less luminous than the longer-period 
ones. This is very well approximated as a linear re- 
lation between the log of the period and the log of 
the luminosity. As a consequence, if one knows the 
slope and intercept of this relationship, and mea- 
sures the period of a Cepheid (which is trivial), one 
can infer the luminosity with quite high precision. 
This makes these stars very useful as "standard can- 
dles," because knowledge of a star's luminosity as 
well as its observed brightness allows us to compute 
the distance from the inverse square law. Knowing 
the distance to the individual Cepheid also gives us 



the distance to the galaxy or cluster of stars in which 
it is embedded. Thus, these stars are fundamental 
in setting the distance scale of the universe. 

The most challenging feature of the problem sta- 
tistically is that the key photometry and radial ve- 
locity curves for a star are unknown, and have no 
simple structure. In Barnes et al. (2003), Fourier 
polynomials of finite (but unknown) degree were 
used to represent these two curves. For instance, 
Figure 1 presents the data concerning the radial ve- 
locity of the surface of the star T Moncerotis, at 
various phases of the star's period (the actual data 
are indicated by the x 's) together with a fifth-order 
trigonometric polynomial fit to the data. Because 
of the possibility of quite arbitrary shapes for the 
photometry and velocity curves for Cepheid vari- 
able stars, we instead desired to model the curves 
via much more flexible wavelet decompositions. 

1.3 Computational Implementation 

Posterior inference in this setup is formally equiv- 
alent to variable selection in a normal linear regres- 
sion problem with massively many candidate covari- 
ates. Posterior simulation requires averaging and/or 
selection across alternative models defined by the 
set of basis functions (wavelets) which are included 
in the model. In the context of normal-linear re- 
gression, common approaches are guided search in 
the model space using the Occam's Window princi- 
ple ((Madigan and Raftery, 1994); (Raftery, Madi- 
gan and Hoeting, 1997)); Markov chain Monte Carlo 
simulation across the model space ((George and Mc- 
Culloch, 1997); (Smith and Kohn, 1996)); and im- 
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portance sampling or Gibbs sampling based on an- 
alytic approximations to the marginal posterior dis- 
tribution on the model indicator ((Clyde, DeSimone 
and Parmigiani, 1996); (Clyde, Parmigiani and Vi- 
dakovic, 1998)). See, for example, Clyde (1999), Hoe- 
ting et al. (1999) and Clyde and George (2004) for 
reviews. In this paper we introduce a scheme for 
fast posterior simulation across the model space, 
marginalizing over the wavelet coefficients. We use 
a computational strategy similar to that used by 
George and McCulloch (1997) and Smith and Kohn 
(1996) to allow fast computation of marginal model 
probabilities when considering models differing by 
only one wavelet basis function. 

2. WAVELET REPRESENTATION 

Wavelet decomposition allows representation of 
any square integrable function f{x) as 
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Here Vjfc(x) = 2^/^il;{2^x - k) and (j)jk{x) = 2^/^ . 
(j){2^x — k) are wavelets and scaling functions at level 
of detail j and shift k. In the context of statisti- 
cal modeling, (1) allows for inference about random 
functions by defining a probability model for the 
coefficients 6 = [cjf^kidjk-, j ^ Jo', k G Z), that is, 
(1) provides a parameterization of a random func- 
tion / in terms of the wavelet coefficients 9. See, 
for example, Vidakovic and Miiller (1999) or Fer- 
reira and Lee (2007), Chapter 5, for a review of 
wavelet representations relevant for statistical mod- 
eling. 

Perhaps the most common application of (1) in sta- 
tistical modeling is to nonlinear regression where f{x) 
represents the unknown mean response E{y\x) for 
an observation y with covariate x. Chipman, Ko- 
laczyk and McCulloch (1997), Clyde, Parmigiani and 
Vidakovic (1998), Vidakovic (1998), Semadeni, Davi- 
son and Hinkley (2004), Tadesse et al. (2005), Wang 
and Wood (2006), ter Braak (2006) and Abramovich, 
Angelini and De Canditiis (2007), among many oth- 
ers, discuss Bayesian inference in such models as- 
suming equally spaced data, that is, covariate val- 
ues Xi are on a regular grid. For equally spaced 
data the discrete wavelet transformation is orthogo- 
nal. Together with assuming independent measure- 
ment errors and a priori independent wavelet coef- 
ficients this leads to posterior independence of the 
djk- Thus the problem essentially reduces to a se- 
quence of univariate problems, one for each wavelet 



coefficient. See, for example, Yau and Kohn (1999) 
for a review. Generalizations of wavelet techniques 
to non-equidistant (NFS) design impose additional 
conceptual and computational burdens. A reason- 
able approximation is to bin observations in equally 
spaced bins and proceed as in the equally spaced 
case. If only few observations are missing to com- 
plete an equally spaced grid, treating these few as 
missing data leads to efficient implementations (An- 
toniadis, Gregoire and McKeague (1994); (Cai and 
Brown, 1998)). We propose instead an approach 
which does not depend on posterior independence. 
Our approach includes informative dependent pri- 
ors with positive prior probabilities for vanishing 
wavelet coefficients. 

3. SHRINKAGE OF f(x) 
3.1 Shrinkage Toward a Smooth Subspace 

Because of the wavelet representation that will be 
used, a function space prior can be defined by con- 
sidering the function at the discrete points {i/n,i = 
1, . . . ,n}, where n = 2^ . Letting /j = f{i/n), con- 
sider the difference process di = fi — fi-i. 

A function space prior that "shrinks toward 
smoothness" can be defined by imposing positive cor- 
relations on the di. Specifically, let d = (di, . . . , d^), 
and define the prior to be p{d) = N{0, A) with Ajj = 
Aexp(— /3|i — j|); that is, we assume a multivariate 
normal prior with scale parameter A and log corre- 
lations proportional to distance. 

Let A(]^^) denote the left upper (n — 1) x (n — 1) 
submatrix of A and partition A into 
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^21) ^(22). 

Let V = Var(X;r=i di) = A Y,7=i E"=i exp(-/3|z -j] 
Assuming /o ~ A^(0, Actq) we find 
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Fig. 2. For fi = 0.1, the left panel plots simulations from the prior process on the unknown function conditioning on all 
wavelet coefficients included; the right panel shows for comparison prior simulations conditional on setting those coefficients 
equal to zero which are excluded by the universal wavelet thresholding rule with y/2na of Donoho and Johnstone (1994). 



In view of the normalization property, 



Ojfcl 



1, 



scaling coefficients at the highest level of detail J 
are approximately proportional to the represented 
function, cjk ~ 2"'^'^/^. Therefore the multivariate 
normal prior on (/o, • • • , /n-i) implies p{cj) = N(Q, 
rj ■ XV) where rj = 2~ . Following common practice 
in the use of wavelet decomposition, we will ignore 
the proportionality constant rj and assume 

p{cj) = N{0,XV). 

As long as we also drop rj in the reconstruction 
of /(x), ignoring the proportionality constant will 
leave the final inference unchanged. 

The prior p{cj) = N{0,\V) implies a dependent 
multivariate normal prior for the vector of all wavelet 



coefficients 
2J - 1) 

(2) 



d 



{cjok,djk,j 



Jn 



, J , k — U, . . . , 



p(d|^ = l) = iV(o,AA). 



In principle A can be found by explicitly comput- 
ing the linear operator of the wavelet decomposi- 
tion. But from a computational point of view this is 
unnecessary and undesirable. Instead Vannucci and 
Corradi (1999) show how A can be derived from V 
as a bivariate wavelet decomposition of V . 

3.2 Shrinkage Through Wavelet Sparsity 

One of the important advantages of wavelet bases 
over alternative bases for L^ functions is the parsi- 
mony property of wavelet representations. Reason- 
ably regular functions are well approximated with 
only few nonzero wavelet coefficients. Therefore 
"shrinkage toward smoothness" can also be induced 



by setting many of the wavelet coefficients to be 
zero. We thus assume positive prior probability for 
vanishing wavelet coefficients. 

Let 7 = (71,..., 7;) denote the vector of indices 
of nonzero wavelet coefficients, that is, dj^ = iff 
[jk) ^7. We define a prior distribution on 7 with ge- 
ometrically decreasing probability for nonzero wave- 
let coefficients in higher levels of detail j: 



Pr((i,fc = 0) = 1 



w 



See, for example, Abramovich, Sapatinas and Silver- 
man (1998) for a discussion of the choice of a. 

We write 6^/ for the subvector of nonzero wavelet 
coefficients djfc, and we use 7 = 1 for the full model 
which includes all coefficients 7 = {{jk),j = Jq, . . . , J 
and A; = 0,...,2-^ — 1). The prior p{6^\^) for the wave- 
let coefficients under model 7 is implied from (2) 
by conditioning the multivariate normal on 6^ = 0, 
h^j. Let Q = V~^ and write ^m for the submatrix 
with rows and columns (71, ... ,7;). Then 



(3) 



pi0^h) = NiO,Xn-^])=N{O,XA). 



We use A to generically denote ^i^\, suppressing the 
dependence on 7 to simplify notation. 

3.3 Illustration of the Shrinkage Effects 

Figures 2 and 3 demonstrate the "shrinkage to- 
ward smoothness" behavior of the priors in Sec- 
tions 3.1 and 3.2. The figures give realizations from 
the priors specified in the two subsections. Figure 2 
utilizes /3 = 0.1 from the prior in Section 3.1 and Fig- 
ure 3 utilizes /3 = 0.9. The smaller /3 induces much 
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Fig. 3. Prior simulations as in Figure 2, but using fi — Q.Q (very little dependence). 



more dependence, clearly resulting in smoother func- 
tions. 

The left panel of each figure is generated from 
use of only the prior in Section 3.1, that is, all the 
wavelet coefficients are kept. In contrast, the right 
panels of each figure show what happens when many 
of the wavelet coefficients are set to zero. (For sim- 
plicity, these were produced using a standard wavelet 
thresholding rule.) Clearly, setting many wavelet co- 
efficients to zero does seem to result in considerable 
additional shrinkage toward smoothness. 

4. POSTERIOR SIMULATION 

We implement posterior inference using Markov 
chain Monte Carlo simulation. Marginalizing over 6^, 
we use the posterior probabilities p(7|y) to define 
a Metropolis-Hastings scheme which proposes moves 
in the model space by adding or deleting one wavelet 
basis function at a time. The computational effort of 
the proposed scheme is comparable to that of George 
and McCulloch (1997) and Smith and Kohn (1996), 
who suggest schemes based on algorithms by Cham- 
bers (1971) and (1979) which allow fast updating of 
a Choleski decomposition of the cross-product ma- 
trix X'X. The algorithms proposed by George and 
McCulloch (1997) and Smith and Kohn (1996) al- 
low computation of marginal posterior probabilities 
with O(g^) basic operations, where q is the num- 
ber of covariates (basis functions) included in the 
model. We describe a similar efficient updating al- 
gorithm in a form suitable for the wavelet regression 
problem. 

Notation. Let Aij be the element in the ith row 
and jth column of a matrix A, with Ai being its 



ith column vector. For a vector 7 = (71, . . . ,7^) we 
denote with A^ the submatrix consisting of columns 
(71,..., 7i), with A(^) the submatrix consisting of 
columns and rows (71,..., 7^), and with ^(--y) the 
submatrix with rows and columns 7= (71,..., 7^) 
removed. 

Let Xi,yi, i = 1, . . . ,N, denote the observed data. 
Let /i = 1, . . . , 2"^ index the wavelet coefficients d = 
{cjokidjk) and let X denote the design matrix 

^jk{xi) for /i = 2-^« -h 1, . . . , n. 



X, 



ih 



4>Jok{xi) for /i=l,...,2'^o. 



where {jk) are the wavelet indices corresponding to 
the hth element in the vector d of wavelet coeffi- 
cients. 

Likelihood. For a given model 7 the wavelet de- 
composition of the unknown velocity curve / implies 
a likelihood 



(4) 



2/,|e,7 ~W(X^0^,5), 



1 N, 



where S = diag((T?) with known variances af,i = 
1,...,N. 

Posterior. Together with prior (3) the likelihood 
implies a multivariate normal posterior piO-ylyjj) = 
N{fj., S) with 

S-i = {X^yS-^X^ +1/Xn(^) and 



^ V ' 

vt 

Again, to simplify notation we suppress the depen- 
dence on 7 in ;U and S. 
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4.1 Down Move 

Assume 7 = (71 , . . . , 7;) and consider a move "down" 
to the submodel 7* = (71, . . . , 7i-i)- Partition S into 






^11 



and similarly fj. = {n(^_i^,fii). Then 
with S* = S 



S/S-^S! and n* 



(_/) - z.iz.11 lui ana ji = ^^_i) + 
milarly, A* = A(_;) - AA;7^A;. 
The corresponding ratio of marginal probabilities 



SS;7i(-w)- Similarly, A* = A(_;) - AA^^A;. 



IS 



pjyh*) 
p{yb) 






S/, 



(l/2)/.2/E„ 



This expression is easily verified using the candidate 
formula p(y|7) = p{0j\'y)p{y\9y,-f)/p{6y\y,-f) and 
substituting 9^ = 0. 

4.2 Up Move 

Consider a move from 7 to 7* = (71,7). Denote 
with (;U, E) and A the posterior and prior moments 
under the (current) model 7: 

p(e^|7,y) = iV(/i,S) and p(^^|7) = iV(0, AA). 

Similarly, let (//*,S*) and A* denote the posterior 
and prior moments under the (proposed) model 7*: 

p{9^*h*,y) = Nifi*,J:*) and 

p{9^,\j*) = N{0,X^*)- 

For posterior simulation we use a lower triangular 
Choleski decomposition of the posterior variance/ 
covariance matrix, TT' = S and T*'T* = S*. The 
new moments /i*,S* and A* and the Choleski de- 
composition T* are computed using the following 
expressions. 

Let Q* = {x^'ys-^x^',n* = f](^,),g = {x^y ■ 

5-1^7 and = il(^\ and partition 



Q* 



Qli Qi 
Ql Q 



and Q* 



O* 
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Of 



Let b = Ql + l/Xni ,h = T,b,c = Ql^ + l/AOJ^ , 60 
Ql, ho = AQl and cq = ^li- Then 





S 



+ 



1 



b'h 



1 -h' 
-h hh' 



and 



A* 



^ 




A 



Co - b'^ho 



1 



-K 



-ho /lo/ln 



+ (c-6'/i)S*S*'7;(^*), 



and T* is obtained by augmenting T with a new 
first column w = T^l/ ^y^fl to 



w T 

The corresponding ratio of marginal probabilities is, 
by symmetry to the down move, 

p{y\r) V s* J 

5. EXAMPLE 

We apply the above methodology to the data for 
the star T Moncerotis, as shown in Figure 1, for the 
choices /3 = 0.1 (strong dependence of the di) and 
a = 0.5 (inducing a moderate level of sparsity). The 
resulting nonparametric posterior is difficult to sum- 
marize; some features of this posterior are presented 
in Figure 4(a). 

It is, of course, one of the strengths of the Bayesian 
approach to shrinkage that uncertainty in the shrink- 
age estimate [the posterior mean of f{x), given by 
the thick center line in Figure 4(a)] can also be given. 
This is crucial in characterizing the (considerable) 
uncertainty in the eventual estimate of distance to 
the star (see (Barnes et al., 2003)). 

Figure 4 also indicates the effect on the T Mon- 
cerotis data of each of the shrinkage priors in Sec- 
tions 3.1 and 3.2. Panel (b) shows the effect of the 
prior in Section 3.1; setting /? = 0.9 effectively makes 
the di independent. Panel (c) shows the effect of the 
prior in Section 3.2; setting q = 0.7 greatly decreases 
the number of wavelet coefficients set to zero. In 
both cases, the posterior functions appear to be un- 
reasonably rough and the uncertainty in the shrink- 
age estimate appears to be unreasonably large. Pa- 
nel (d) , which effectively uses neither of the shrink- 
age techniques, is especially unsatisfactory. 
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Fig. 4. Posterior inference for T Moncerotis. In all four panels, the thick smooth line shows the posterior mean curve. The 
gray shaded margins show central 50% (light gray) and central 90% (dark gray) intervals. The points are the observed data 
points, with little error bars showing 2 standard deviations for the measurement error. Panel (a) shows inference under /? = 0.1 
and a = 0.5. Panels (b) through (d) show posterior inference using (3 = 0.9 (h and d) and a = 0.7 (c and A). Fixing (3 = 0.9 
essentially assumes independence of the di and implies less smoothing; setting a = 0.7 greatly decreases the number of wavelet 
coefficients set to zero. 
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